New York House Prices
Group L12 G03
Cifti Saggu, Daniella Jaqin, Hamza Ahmed, Zichen Liu and Mike Xu
Data Description
Introduction
Random sample of data from a county in New York
It’s sourced from the Data And Story Library (DASL).
Clean, since no missing variables ✔.
Dependent variable → price which represents the sales price of each house
Independent variables → age, land value, living area, bathrooms, and rooms
Data Description
Variables
- Age - The age of the house, typically measured from the year of construction till current (2023 in this case).
- Land Value (USD) - The assessed value of the land.
- Living Area (sq ft) - The size of the interior living space of the house.
- Bathrooms - This includes full bathrooms and half-baths.
- Number of Rooms - This represents the total number of rooms in the house.
Data Description
2. Categories
- Numerical Continuous: Age, Land Value, Living Area
- Numerical Discrete: Bathrooms, Number of Rooms
![]()
Source: Depositphotos
Appropriate model selection - Mike
Goal: Predict house prices based on several property characteristics.
How did we do this? Focus on multiple regression by selecting the best variables for the predictive model
What models we compared? The forward and backward step wise selection and exhaustive search model.
![]()
Source: iStock
Appropriate model selection
✔ These methods helped identify the most relevant property features that contribute to accurate price predictions.
✔ Multiple regression model:
Assumptions
1. Linearity
Given the correlation plots we have chosen the variables:age, land value, living area, bathrooms and rooms
Assumptions
2. Independance
![]()
Figure 1: Residual Plots for Predictors vs Model
Assumptions
3. Homoskedasticity
![]()
Figure 2: Residual Plot of Chosen Variables
Breusch-Pagan test checks for heteroskedasticity and since the p-value (0.196) is > 0.05, this suggests there is limited evidence of heteroskedasticity. Therefore, the residuals indicate that the assumption of constant variance holds.
Assumptions
4. Normality
![]()
Figure 3: Q-Q Plot of Linear Numeric Variables
Comparing Models
Model Comparison Based on Key Criteria
| Forward Selection |
✔ MAE: 0.21454 RMSE: 0.33068 R²: 0.5586 |
✔ Adapts well to new data |
✔Straightforward |
✔ Highlights Key Drivers AIC: 792.91 BIC: 831.12 |
| Backward Selection |
✔ MAE: 0.21454 RMSE: 0.33068 R²: 0.5586 |
✘ Overfit risk on larger data |
✘ Complex interactions |
✘ Complex to implement AIC: 792.91 BIC: 831.12 |
| Exhaustive Search |
✔ MAE: 0.21702 RMSE: 0.33412 R²: 0.5467 |
✘ Expensive Computationally |
✘ Difficult to explain |
✘ complex & overwhelming AIC: 835.16 BIC: 862.46 |
Model Outputs
Forward Model
![]()
Backward Model
![]()
Exhaustive Search
![]()
Why Forward Model is Best
- Ideal for stakeholders: Highlights key variables impacting price and utilizes new data only when it improves accuracy
- Avoids unnecessary, complex interactions: perfect for non-technical audiences
- Mirrors real-world property assessment, starting from basics (size) to more specific features (bedrooms, bathrooms, age).
Limitations
- Nature of Data set:
Most of initial data set was categorical, limiting the continuous predictors used in Model.
Due to it being specific to the New York setting, it may not be applicable to other states or countries.
Variables with non-linear relationship cannot be included as a result limiting the features of a house that can be compared.
Future Improvements
Improve Data set: Include location or economic indicators(inflation, interest rates) and add more continuous predictors to give finer granularity
Consider Non-Linear Models (e.g., decision trees, random forests) to capture complex relationships without sacrificing interpretability.
Conclusion
- The data used in this analysis is a random sample of houses taken from a New York County
- Our goal: To find the extent to which numeric features of a house impact its price in the sample provided of houses in New York
![]()
Source: iStock
Key Findings
- We used the multiple regression workflow to deduce an equation which assists in identifying to what extent numeric features impact the price of a house.
log(Price)= 0.0002983×(living area) + 0.000003596×(land value) + 0.1107×(bathrooms) − 0.001356×(age) + 0.009131×(rooms)
Key Findings
log(Price)= 0.0002983×(living area) + 0.000003596×(land value) + 0.1107×(bathrooms) − 0.001356×(age) + 0.009131×(rooms)
| Bathrooms (number of bathrooms) |
Increase |
11.07% |
| Rooms (number of rooms) |
Increase |
0.91% |
| Living Area (square feet) |
Increase |
0.03% |
| Age (years) |
Decrease |
0.14% |
| Land Value (US dollars) |
Increase |
0.0004% |
Why People Should Care
Understanding the features and factors is critical for us all as potential future homeowners, but also for investors, real estate agents, policymakers, and developers who rely on this knowledge to make decisions.
Although this data is from New York, it highlights the key features of a house that can lead to significant price differences.
Final Takeaway
Bathrooms had the biggest impact on price with the number of rooms following second.
A house’s age, land value and living area all had a small impact as well with age being the only one that led to a decrease in price.
Thank you for your attention and we hope this assisted in your home buying decisions!